Problem Statement: Concrete Strength Prediction¶Objective¶To predict the concrete strength using the data available in file "concrete.csv". Apply feature engineering and model tuning to obtain a score above 85%.
Resources Available¶The data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/. The same has been shared along with the course content.
Attribute information¶Given are the variable name, variable type, the measurement unit, and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.
Name Data Type Measurement Description
Univariate analysis (10 marks)¶Data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates (10 Marks)
Bi-variate analysis (10 marks)¶between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves.
Feature Engineering techniques (10 marks)¶Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth.
Get the data model ready and do a train test split.
Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree. (10 marks)
Creating the Model and Tuning It (30 marks)¶Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)
Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)
Import all the header libraries¶import os, sys, re
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
# For Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Add nice background to the graphs
sns.set(color_codes=True)
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
import pydotplus
import graphviz
# sklearn libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LogisticRegression
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
# calculate confusion matrix and various performance metrics
from sklearn.metrics import confusion_matrix, recall_score, precision_score, \
f1_score, accuracy_score, mean_squared_error, r2_score
#AUC ROC curve
from sklearn.metrics import roc_auc_score, roc_curve
# Additional libs
from yellowbrick.classifier import ClassificationReport, ROCAUC
from six import StringIO
from IPython.display import Image
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
Read the input file into pandas dataframe¶# Let's read the data into dataframe
csp_df = pd.read_csv('concrete.csv')
csp_df.head()
1. Univariate analysis (10 marks)¶Data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates (10 Marks)
# print the shape of the dataframe
csp_df.shape
# print the datatypes of each columns
csp_df.dtypes
Insights:
There are 9 columns and 1030 rows in the dataset. Looking at the output of the dtypes, observed that columns have float and int datatype. Observed the duplicate records and columns with zeros and I will treat/convert/drop them as needed in the following steps
# List down total samples and weather they are null and the dtype of the dataframe
csp_df.info()
# Check the unique values in each column of the dataframe.
csp_df.nunique()
# Let's find all the duplicate rows in the data frame
csp_df[csp_df.duplicated(keep=False)]
# Let's drop all the duplicate rows from the dataframe
csp_df.drop_duplicates(inplace=True, keep="first")
csp_df.shape
csp_df.info()
NOTE: Dropped the duplicate records so shape is reduced to 1005 rows and 9 columns¶csp_df.describe().T
# Cross validate the Non-Null value reported by info()
csp_df.isnull().sum()
# Let's replace all the Zeros from the columns(slag, ash and superplastic) with mean value
# Make all 0 to np.nan , then fillna
csp_df=csp_df.mask(csp_df==0).fillna(csp_df.mean())
csp_df.describe().T
Insights:
I have found few sample records with duplicate rows and dropped them. Looking at the output of the describe(), info(), and isnull(), it says that there are no columns with missing values. There were few columns with 0's and replaced them with the mean value of the column. I will revisit this section and treat the variables in the data preparation stage.
# List all the columns of the dataframe
csp_df.columns
# Let's visualize the individual data
# and see how do they look. This helps in further decision making about the data
csp_df[csp_df.columns].hist(stacked=False, bins=50, figsize=(30,120), layout=(10,1));
# Let's implement the valueCount function
def valueCount(pd_df=None):
columns = pd_df.columns
for col in columns:
print('value_counts for {}'.format(col))
print(pd_df[col].value_counts(normalize=True).head(10))
print()
# Let's print the value count value of all the variables. Let's limit the rows to 10 for better visibility
valueCount(csp_df)
Insights:
` Looking at the value_count() dump and the histogram plot, following observatios are made:
1). The data for the slag, ash and superplastic columns are highly scewed and most samples had value 0. Replaced zero with mean value
2). Age column is also scewed towards the lower number. ie. there are more samples with days value <=28. The most age samples value is 28. and they contibute ~41% of the total data samples
`
2. Bi-variate analysis (10 marks)¶between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves.
# NOTE: Disabling the pair plot as It takes lots of memory and compute power.
# Let's plot pair plots to see the relation between variables
plt.figure(figsize=(20,5))
sns.pairplot(csp_df)
plt.show()
# Let's plot the box plot between age and strength.
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(csp_df['age'], csp_df['strength']);
plt.show()
# Let's plot the box plot between superplastic and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['superplastic'], csp_df['strength']);
plt.xticks(
rotation=90,
horizontalalignment='center',
fontweight='normal',
fontsize='large'
)
plt.show()
# Let's plot the box plot between fineagg and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['fineagg'], csp_df['strength']);
plt.xticks(
rotation=90,
horizontalalignment='center',
fontweight='normal',
fontsize='large'
)
plt.show()
# Let's plot the box plot between coarseagg and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['coarseagg'], csp_df['strength']);
plt.xticks(
rotation=90,
horizontalalignment='center',
fontweight='normal',
fontsize='large'
)
plt.show()
# Let's plot the box plot between ash and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['ash'], csp_df['strength']);
plt.xticks(
rotation=90,
horizontalalignment='center',
fontweight='normal',
fontsize='large'
)
plt.show()
# Let's plot the box plot between slag and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['slag'], csp_df['strength']);
plt.xticks(
rotation=90,
horizontalalignment='center',
fontweight='normal',
fontsize='large'
)
plt.show()
# Let's plot Strength vs Cement, Water and age
plt.figure(figsize=(30,10))
sns.scatterplot(y="strength", x="cement", hue="water",size="age", data=csp_df, sizes=(50, 300))
# Let's plot Strength vs Fineagg, Ash and Superplastic
plt.figure(figsize=(30,10))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=csp_df, sizes=(50, 300))
# Let's see the corelation of the variables
corr = csp_df.corr()
sns.heatmap(corr, annot=True, cmap='Blues')
# Let's see the customer's various account relationship for strength
csp_df.groupby('strength').mean()
# pd.crosstab(csp_df['strength'],csp_df['ash'],normalize='index')
Insights:
`
Looking at the bivariat pair plots and corelation plots, following observations are made:
1). Observe a high positive correlation between Strength and Cement. Concrete strength indeed increases with an increase in the amount of cement used in preparing it.
2). Age and Superplastic are the other two factors influencing Compressive strength. There are other strong correlations between the features: A strong negative correlation between Superplastic and Water. Found positive correlations between Superplastic and Ash. Found positive correlation between Superplastic Fine Aggregate.
3). Compressive strength increases as the amount of cement increases
4). Compressive strength increases with age
5). Concrete strength increases when less water is used in preparing it
6). Compressive strength decreases with ash increases
7). Compressive strength increases with Superplasticizer
8). Compressive strength increases with slag
`
Feature Engineering techniques (10 marks)¶Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth. Get the data model ready and do a train test split. Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree. (10 marks)
NOTE: All the dependent variables are Positively or Nagitively related to Strength at some considirable degree, we will use all the variables in model building¶# Looking at the the data they are in different scale. So let's bring all the variable in the same scale using standard scaler
ss_scaler = StandardScaler()
# Let's separate the dependent and independent variables into X and Y variables
csp_df_Y = csp_df[['strength']]
csp_df_X = csp_df.drop('strength', axis=1)
# Let's apply Binning to output variable "strength"
# Let's replace numrical values to low, mid and high
csp_df_Y['strength'] = pd.cut(csp_df_Y['strength'], bins=[0,27,54,90], labels=["low", "mid", "high"])
# print(csp_df_Y['strength'])
# Let's convert string to numerical values
csp_df_Y = csp_df_Y.replace({"strength":{"low":0, "mid":1, "high":1}})
# Fit and transform the X variables
csp_df_X_scaled = pd.DataFrame(ss_scaler.fit_transform(csp_df_X), columns=csp_df_X.columns)
# Let's split the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(csp_df_X_scaled, csp_df_Y, test_size=0.30, random_state=11)
X_train.shape
X_test.shape
y_train.shape
y_test.shape
print(X_train.head())
print()
print(y_train.head())
print(X_test.head())
print()
print(y_test.head())
we can fit different models on the training data and compare their performance. As this is a regression problem, for primary model, I will use linear regression model and will apply Lasso and Reidge algorithms. I will use RMSE (Root Mean Square Error) and R² score as evaluation metrics. I will apply poly regression models and will finalize the algorithm with good performance. Later, I will use more complex algorithms such as Logistic Regression,decision tree, RamdomForest, bragging and boosting algorithms and compare the performance¶# Linear Regression
lr = LinearRegression()
# Fitting models on Training data
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print()
print()
# Making predictions on Test data
y_pred_lr = lr.predict(X_test)
# Compute rmse and R2
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_lr = r2_score(y_test, y_pred_lr)
print("Linear Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_lr, R2_score_lr))
print()
# Let's apply Lasso Regression Algorithm
lasso = Lasso()
# Fitting models on Training data
lasso.fit(X_train, y_train)
# Making predictions on Test data
y_pred_lasso = lasso.predict(X_test)
# Compute rmse and R2
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_lasso = r2_score(y_test, y_pred_lasso)
print("Lasso Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_lasso, R2_score_lasso))
# Let's apply Ridge Regression Algorithm
print()
# Let's apply Lasso regression algorithm
ridge = Ridge()
# Fitting models on Training data
ridge.fit(X_train, y_train)
# Making predictions on Test data
y_pred_ridge = ridge.predict(X_test)
# Compute rmse and R2
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_ridge = r2_score(y_test, y_pred_lr)
print("Ridge Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_ridge, R2_score_ridge))
NOTE: Looking at the above results, Linear and Ridge regression shows same behaviour. Lasso algorithm has same RMES error but the R2_score is very bad. Let's work on poly regression model¶# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
# PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
poly_X_train = poly.fit_transform(X_train)
# print(poly_X_train)
# Fitting models on Training data
poly_regress_model = lr.fit(poly_X_train, y_train)
print(poly_regress_model.score(poly_X_train, y_train))
Creating the Model and Tuning It (30 marks)¶Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)
# Let's create an empty dictionary to store model name and performance matrix
m_perf_matrix = dict()
## Create a function to get confusion matrix in a proper format
def pltConfusionMatrix(actual, predicted):
cm = confusion_matrix(actual, predicted)
sns.heatmap(cm, annot=True, fmt='.4f', xticklabels=[0,1], yticklabels=[0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()
# Compute performance matrix
def computePerfMatrix(m_classifier, m_predict, kfold=None):
# Let's store all the perf variables in the dict.
perf_matrix = dict()
perf_matrix['training_accuracy'] = m_classifier.score(X_train, y_train)
print("Trainig dataset accuracy:", m_classifier.score(X_train, y_train))
print()
perf_matrix['test_accuracy'] = m_classifier.score(X_test, y_test)
print("Testing dataset accuracy:", m_classifier.score(X_test, y_test))
print()
print('Confusion Matrix:')
print(pltConfusionMatrix(y_test, m_predict))
print()
perf_matrix['recall_score'] = recall_score(y_test, m_predict, average="micro")
print("Recall:",recall_score(y_test, m_predict, average="micro"))
print()
perf_matrix['precision_score'] = precision_score(y_test, m_predict, average="micro")
print("Precision:",precision_score(y_test, m_predict, average="micro"))
print()
perf_matrix['f1_score'] = f1_score(y_test, m_predict, average="micro")
print("F1 Score:", f1_score(y_test, m_predict, average="micro"))
print()
perf_matrix['roc_auc_score'] = roc_auc_score(y_test, m_predict, average="micro")
print("Roc Auc Score:", roc_auc_score(y_test, m_predict, average="micro"))
print()
if (kfold is not None):
perf_matrix['cross_val_score'] = cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=kfold, scoring='roc_auc').mean()
print("cross_val_score:", cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=kfold, scoring='roc_auc').mean())
else:
perf_matrix['cross_val_score'] = cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=10, scoring='roc_auc').mean()
print("cross_val_score:", cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=10, scoring='roc_auc').mean())
# Return the model specific resultssss
return perf_matrix
# Visualize the performance matrix
# # Visualize model's performance with yellowbrick library
def visualizePerfMatrix(m_classifier):
viz = ClassificationReport(m_classifier)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(m_classifier)
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show()
Logistic Regression model¶# let's make Logistic Regression model
#
lr = LogisticRegression(random_state=11, penalty='l1', solver='liblinear', class_weight=None, C=1.0)
lr.fit(X_train, y_train)
lr_predict = lr.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# # Compute the various perf matrix
perf_matrix_d = computePerfMatrix(lr, lr_predict, kfold=k_fold)
m_perf_matrix["LogisticRegression"] = perf_matrix_d
# Visualize model's performance with yellowbrick library
visualizePerfMatrix(lr)
Decision Tree model¶# Decision tree model with criterion=gini which is default and with random_state=11. Let's change the max_depth=4.
dtc = DecisionTreeClassifier(criterion='gini', random_state=11, max_depth=4)
dtc.fit(X_train, y_train)
# Predict the performance of test samples
dtc_predict = dtc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# # Compute the various perf matrix
perf_matrix_d = computePerfMatrix(dtc, dtc_predict, kfold=k_fold)
m_perf_matrix["DecisionTree"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(dtc)
Random Forest classifier Model¶# Let's build RandomForestClassification model
rfc = RandomForestClassifier(n_estimators=50, criterion='gini', random_state=11)
rfc = rfc.fit(X_train, y_train)
# Predict the performance of the test samples
rfc_predict = rfc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(rfc, rfc_predict, kfold=k_fold)
m_perf_matrix["RandomForest"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(rfc)
Adaboost Classifier Model¶# Let's implement the adaboost emsemble classifier algorithm
abc = AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=11)
abc = abc.fit(X_train, y_train)
# Predict the performance of the tests samples
abc_predict = abc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(abc, abc_predict, kfold=k_fold)
m_perf_matrix["AdaBoost"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(abc)
Bagging Classifier model¶# Let's build bagging model
bgc = BaggingClassifier(n_estimators=50, max_samples=0.7, random_state=11, bootstrap=True, oob_score=True)
bgc = bgc.fit(X_train, y_train)
# Predict the performance of the tests samples
bgc_predict = bgc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(bgc, bgc_predict, kfold=k_fold)
m_perf_matrix["Bagging"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(bgc)
Gradient Boosting Classifier model¶gbc = GradientBoostingClassifier(n_estimators=50, learning_rate = 0.1, random_state=11, max_depth=4)
gbc = gbc.fit(X_train, y_train)
# Predict the performance of the tests samples
gbc_predict = gbc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(gbc, gbc_predict, kfold=k_fold)
m_perf_matrix["GradientBoosting"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(gbc)
Conclusion Summary of the performance of KFold cross validations¶print(m_perf_matrix)
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format
# Conver the dictionary to dataframe
kfold_summary_df = pd.DataFrame.from_dict(m_perf_matrix,orient='index').T
kfold_summary_df
Noted following from above table:¶RandomForest and Bagging algorithm are the the best overall algorithms.
Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)
# Let's create an empty dictionary to store model name and performance matrix
random_search_perf_matrix = dict()
Logistic Regression model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {
"penalty" : ['l1', 'l2'],
"C": [0.1, 0.15, 0.50, 0.75, 0.90, 1.0],
"solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
"dual" : [True, False],
"fit_intercept" : [True, False]
}
# Number of random samples
samples = 11
random_search_lr = RandomizedSearchCV(lr, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_lr.fit(csp_df_X, csp_df_Y)
print(random_search_lr.best_params_)
random_search_perf_matrix["random_search_lr"] = random_search_lr.best_params_
Decision Tree model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
"max_features": [1, 3, 8],
"min_samples_split": [2, 3, 8],
"min_samples_leaf": [1, 3, 8],
#"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
# Number of random samples
samples = 11
random_search_dtc = RandomizedSearchCV(dtc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_dtc.fit(csp_df_X, csp_df_Y)
print(random_search_dtc.best_params_)
random_search_perf_matrix["random_search_dtc"] = random_search_dtc.best_params_
Random Forest classifier Model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
"max_features": [1, 3, 8],
"min_samples_split": [2, 3, 8],
"min_samples_leaf": [1, 3, 8],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
# Number of random samples
samples = 11
random_search_rfc = RandomizedSearchCV(rfc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_rfc.fit(csp_df_X, csp_df_Y)
print(random_search_rfc.best_params_)
random_search_perf_matrix["random_search_rfc"] = random_search_rfc.best_params_
Adaboost Classifier Model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {"algorithm" : ['SAMME', 'SAMME.R'],
"learning_rate": [0.1, 0.5, 0.75, 1.0],
}
# Number of random samples
samples = 11
random_search_abc = RandomizedSearchCV(abc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_abc.fit(csp_df_X, csp_df_Y)
print(random_search_abc.best_params_)
random_search_perf_matrix["random_search_abc"] = random_search_abc.best_params_
Bagging Classifier model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {"max_features": [1, 3, 8],
"bootstrap": [True, False],
"bootstrap_features" : [True, False],
"oob_score" : [True, False],
}
# Number of random samples
samples = 11
random_search_bgc = RandomizedSearchCV(bgc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_bgc.fit(csp_df_X, csp_df_Y)
print(random_search_bgc.best_params_)
random_search_perf_matrix["random_search_bgc"] = random_search_bgc.best_params_
Gradient Boosting Classifier model with Random Search¶# Specify parameters and distributions to sample from
param_dist = {"max_features": [1, 3, 8],
"max_features": [1, 3, 8],
"min_samples_split": [2, 3, 8],
"min_samples_leaf": [1, 3, 8],
"learning_rate" : [0.1, 0.5, 0.75, 1.0],
"loss" : ['deviance', 'exponential'],
"criterion" : ['friedman_mse', 'mse', 'mae']
}
# Number of random samples
samples = 11
random_search_gbc = RandomizedSearchCV(gbc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold
random_search_gbc.fit(csp_df_X, csp_df_Y)
print(random_search_gbc.best_params_)
random_search_perf_matrix["random_search_gbc"] = random_search_gbc.best_params_
Conclusion Summary of the best parameters of Random Search CV with various Algorithms¶print (random_search_perf_matrix)
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format
# Conver the dictionary to dataframe
random_search_summary_df = pd.DataFrame.from_dict(random_search_perf_matrix,orient='index').T
random_search_summary_df
Final Random Forest Model¶# Let's build RandomForestClassification model
rfc = RandomForestClassifier(n_estimators=50, criterion='entropy', random_state=11, min_samples_split=3, min_samples_leaf=1, max_features=3, max_depth=None,bootstrap=False)
rfc = rfc.fit(X_train, y_train)
# Predict the performance of the test samples
rfc_predict = rfc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(rfc, rfc_predict, kfold=k_fold)
m_perf_matrix["RandomForest"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(rfc)
Final Bagging Model¶# Let's build bagging model
bgc = BaggingClassifier(n_estimators=50, max_samples=0.7, random_state=11, oob_score=False, max_features=8, bootstrap_features=True, bootstrap=False)
bgc = bgc.fit(X_train, y_train)
# Predict the performance of the tests samples
bgc_predict = bgc.predict(X_test)
# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(bgc, bgc_predict, kfold=k_fold)
m_perf_matrix["Bagging"] = perf_matrix_d
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(bgc)
Final Summary:¶print(m_perf_matrix)
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format
# Conver the dictionary to dataframe
kfold_summary_df = pd.DataFrame.from_dict(m_perf_matrix,orient='index').T
kfold_summary_df
If you consider the overall model, The Ramdon Forest is the best model
Performance metrics¶Precision: Fraction of actuals per label that were correctly classified by the model
Recall: Fraction of predictions that were correctly classified by the model
F1-score: Weighted harmonic mean of the precision and recall. F1-score: 2*(precision x recall)/(precision + recall)
Accuracy: Fraction of all observations that were correctly classified by the model
Macro avg: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
Micro/weighted avg: Calculate metrics globally by counting the total true positives, false negatives and false positives
AUC Score: Given a random observation from the dataset that belongs to a class, and a random observation that doesn't belong to a class, the AUC is the perecentage of time that our model will classify which is which correctly